contextual global planning
Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation
The ability to perform effective planning is crucial for building an instruction-following agent. When navigating through a new environment, an agent is challenged with (1) connecting the natural language instructions with its progressively growing knowledge of the world; and (2) performing long-range planning and decision making in the form of effective exploration and error correction. Current methods are still limited on both fronts despite extensive efforts. In this paper, we introduce Evolving Graphical Planner (EGP), a module that allows global planning for navigation based on raw sensory input. The module dynamically constructs a graphical representation, generalizes the local action space to allow for more flexible decision making, and performs efficient planning on a proxy representation. We demonstrate our model on a challenging Vision-and-Language Navigation (VLN) task with photorealistic images, and achieve superior performance compared to previous navigation architectures. Concretely, we achieve 53% success rate on the test split of Room-to-Room navigation task (Anderson et al.) through pure imitation learning, outperforming previous architectures by up to 5%.
Review for NeurIPS paper: Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation
Additional Feedback: 148: "ranked by confidence scores in policy distribution": are these nodes with the highest probability under the current policy \pi? 163: I was unclear on what metric M is used. I think it would be helpful to give some more motivation for the proxy graph. Is the proxy graph necessary because the underlying navigation graphs are too large or too densely connected to run a GNN on? Or does the proxy graph improve performance on the metrics? Can "mp" in Table 3 be defined in terms of the K_p or K_L defined in section 4.2? --- update after author response --- Thank you for all your helpful clarifications! They largely address my concerns, and I've updated my score to a 6.
Review for NeurIPS paper: Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation
This paper addresses the problem of vision-and-language navigation from raw visual input and language instructions in a photorealistic indoor environment (Room-to-Room) by iteratively building a high-level graph representation and then goal-driven planning using Graph Neural Networks. Instead of planning on the full graph, the model predicts actions over the fringe nodes of that graph (i.e., jumps through the graph using shortest path) and it also predicts and plans on a sparser proxy graph representation (these are novel ideas). It is trained using imitation learning. After discussion and authors' rebuttal, the reviewers' scores are (6, 7, 7, 6). While many of the reviewers' concerns are addressed, the main remaining concerns are a missing comparison to graph search methods (specifically: "Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation"), confusion about the word planning in an imitation learning setting, discussions about how loop closure is performed, acknowledging the competitive advantage of knowing which nodes of the graph are frontier nodes, and scarce information about how to reproduce the work.
Evolving Graphical Planner: Contextual Global Planning for Vision-and-Language Navigation
The ability to perform effective planning is crucial for building an instruction-following agent. When navigating through a new environment, an agent is challenged with (1) connecting the natural language instructions with its progressively growing knowledge of the world; and (2) performing long-range planning and decision making in the form of effective exploration and error correction. Current methods are still limited on both fronts despite extensive efforts. In this paper, we introduce Evolving Graphical Planner (EGP), a module that allows global planning for navigation based on raw sensory input. The module dynamically constructs a graphical representation, generalizes the local action space to allow for more flexible decision making, and performs efficient planning on a proxy representation.